Method and apparatus for speech recognition which is robust to missing speech data

ABSTRACT

A speech recognizer suitable for distributed speech recognition is robust to missing speech feature vectors. Speech is transmitted via a packet switched network in the form of basic feature vectors. Missing feature vectors are detected and replacement feature vectors are estimated by interpolation of received data prior to speech recognition. Features may be converted and interpolation may be accomplished in a spectral domain.

This application is the U.S. national phase of international applicationPCT/GB00/04206 filed Nov. 2, 2000 which designated the U.S.

BACKGROUND

1. Technical Field

This invention relates to a method of and an apparatus for speechrecognition which is robust to missing speech data. It is particularlyuseful in distributed speech recognition in which data is transmittedvia a packet switched network.

2. Description of Related Art

Recently there has been an enormous increase in the use of mobiledevices such as mobile phones and personal digital assistants. It isdesirable to make the human to device interface as natural and easy touse as possible. Speech recognition is one solution which increasesnaturalness, and overcomes the difficulties in using very smallkeyboards found on many mobile devices. A Personal Computer (PC) usuallyprovides sufficient processing power to operate a speech recogniser.However, on mobile devices processing power is a limiting factor. Onesolution is to use distributed speech recognition (DSR). DSR makes useof remote speech recognisers which are accessed by a device across atransmission network. Speech data from the device is transmitted acrossthe network to the remote speech recogniser and the remote speechrecogniser processes the speech to provide a recognition result (or setof results) which is then transmitted back to the device.

There are basically two types of network across which such informationcan be transmitted; namely connection-orientated networks andconnectionless networks. The connection-orientated network isessentially the telephony service which has evolved over the last 100years for the switching and transmission of voice data. A connectionlessnetwork is packet-based and its main functionality is the routing andswitching of data packets from one location to another.

When a call is made on a connection-orientated network a reservation ismade to ensure that sufficient network resources are available tosustain the call. This may be the allocation of a physical connection orof time slots in a pulse code modulation (PCM) system. If sufficientresources are not available then the call is refused, typicallyaccompanied by an engaged signal.

The connectionless network is very much aimed at the routing andswitching of data packets and is designed to efficiently handle the highburstiness of this traffic. Packets are comprised of two parts—a headerand payload. The header contains information regarding the source anddestination address while the payload contains the actual data whichneeds to be sent.

For transmitting real-time data such as speech, the essential differencebetween the two networks is that the connection-orientated networkreserves sufficient capacity, or bandwidth, to maintain a connectionthroughout the call. With a connectionless network sufficient bandwidthis not guaranteed which means that the network may produce delays ormissing packets which interrupt the data transmission. Therefore theconnection-orientated network is much better suited to deliveringreal-time data. Voice has therefore traditionally been transmitted usingconnection-orientated networks. However, because of the enormous growthin data networks, the technique of Voice over Internet Protocol (VoIP)has been developed to allow the real-time transmission of voice signalsacross connectionless networks.

In a connectionless network the packets containing the speech can berouted across a wide variety of paths depending on the network traffic.Indeed, it may be that successive packets are routed around the networkon different paths. As a result it is possible that some packets arriveout of sequence or may never even arrive. This is clearly undesirable ina DSR system as it will introduce recognition errors. An approach todealing with this problem of missing packets is to use protocolsdesigned specifically real-time data which ensure all the data arriveswith minimal delay.

The traditional connectionless network is termed best-effort. This meansthat packets from a source are sent to a destination with no guaranteeof a timely delivery. For applications such as file transfer whichrequire a guarantee of delivery, Transmission Control Protocol (TCP) isable to trade packet delay for guaranteed reception. In the event oflost packets TCP allows for the destination to request theretransmission of those lost packets. However, for real-time data it isimportant to minimise transmission delays. It is therefore impracticableto use TCP and wait for the retransmission of lost packets. A betterapproach is to use User Datagram Protocol (UDP) as the protocol forsending the packets. This has-a short duration buffer which allows forslight delays in packet arrival after which UDP assumes the packet isnot going to arrive. No facility for the retransmission of lost packetsis available. This has the advantage that delays are minimised but atthe expense of possibly losing some of the speech signal when networktraffic is high and packet loss is probable.

Protocols designed specifically for real-time data transmission includeResource Reservation Protocol (RSVP). This is a signalling protocolwhich reserves network resources at the start of a call to ensure that adirect connection to the destination is available throughout. In effectit makes a connection-orientated path from a connectionless network. Inorder for this to function all the routers in the network from thesource to destination must be RSVP enabled. As RSVP is a relatively newprotocol not all routers are equipped with this facility.

Another protocol designed specifically for real-time data transmissionis DiffServ. This makes use of a byte of data in the packet header tospecify a Type of Service (ToS)—i.e. how much priority should be givento the immediate routing of that packet through the network. Clearlysome data will have very high priority such as network management andsystem commands. Lower priority will be given to file transfer and emailwhere immediate delivery is not too important. Depending on the emphasisgiven to the network, high priority can be given to speech packets toassist real-time use. Again, this protocol is only in development andnot available generally.

The increase of connectionless voice networks, coupled with the increasein automation of call centres means that the ability to perform robustspeech recognition over a connectionless network is becoming moreimportant.

An alternative approach to ensuring that all packets containing thespeech signal successfully reach the speech recogniser is to make therecogniser itself robust to missing packets. When the packet loss is low(<5%) the drop in recognition performance is not too significant.However, as packet loss increases—or occurs in bursts—the effect is moredetrimental. Therefore, a speech recogniser is required which is able totolerate this loss of speech.

Known signal processing techniques which deal with missing packets rangefrom very simple to complex—a good review is made in C. Perkins, O.Hodson and V. Hardman, “A survey of packet loss recovery techniques forstreaming audio”, IEEE Network Magazine, Vol. 12, No. 5, pp. 40–48,October 1998. Simple techniques include splicing which merely joins thespeech signal together either side of the gap. Silence and noisesubstitution replace the missing frames of speech with either silence ornoise. Repetition replaces the lost frames of speech with copies of thespeech which arrived before the gap.

More sophisticated techniques attempt to estimate the missing parts ofthe signal from those parts which have been correctly received. Theseinclude waveform substitution which uses the pitch on either side of thegap to estimate the missing speech. Time scale modification stretchesthe audio signal either side of the gap to fill in the missing speech.Regeneration-based repair uses parameters of the codec to determine therequired fill-in speech. All these techniques attempt to reconstruct thetime-domain speech signal.

BRIEF SUMMARY OF NON-LIMITING EXEMPLARY EMBODIMENTS

According to the present invention there is provided a method of speechrecognition comprising the steps of receiving a sequence of transmittedfeature vectors, said feature vectors representing a speech signal;detecting the absence of a feature vector in the received sequence;generating an estimated replacement feature vector for the detectedabsent feature vector; inserting said replacement feature vector intothe received feature vector sequence to provide a modified featurevector sequence; and performing speech recognition upon the modifiedfeature vector sequence.

Preferably the feature vector comprises a plurality of components andthe generating step comprises estimating a component of a replacementfeature vector by interpolating the corresponding component of areceived feature vector.

In a preferred embodiment the estimating step uses an interpolationcoefficient corresponding to a component of the received feature vectorand further comprising the step of updating the interpolationcoefficient in response according to one or more received featurevectors.

In an alternative embodiment of the invention the transmitted featurevectors include features in a cepstral domain, and in which theestimating step comprises the sub steps of converting a received featurevector to a spectral domain; estimating a spectral component byinterpolating the corresponding component of the converted featurevector; and converting the estimated spectral component to said cepstraldomain.

According to another aspect of the invention there is provided a devicefor performing speech recognition upon a sequence of parameterisedfeature vectors comprising a missing feature vector detector arranged inoperation to receive the transmitted feature vectors and to indicate theabsence of a feature vector in the received sequence; a feature vectorestimator arranged, in operation, to receive transmitted feature vectorsand responsive to said indication from the missing feature vectordetector to estimate a replacement feature vector; a sequencereconstructer arranged, in operation, to receive transmitted featurevectors and to receive a replacement feature vector and to provide as anoutput a modified feature vector sequence; and a speech recogniserarranged, in operation, to receive the modified feature vector sequence.

Preferably the feature vector estimator comprises an interpolatorarranged to receive a feature vector and to provide as an output acomponent of the replacement feature vector.

In a preferred embodiment in the interpolator uses an interpolationcoefficient corresponding to a component of the received feature vectorand in which the interpolator is arranged to update the interpolationcoefficient in response to receipt of a feature vector.

In an alternative embodiment of the invention the feature vectorestimator comprises a first converter for converting a received featurevector to a spectral domain; an estimator for estimating a spectralcomponent by interpolating the corresponding component of the convertedframe; a second converter for converting the estimated spectralcomponent to said cepstral domain.

A data carrier loadable into a computer and carrying instructions forcausing the computer to carry out a method according to the inventionand a data carrier loadable into a computer and carrying instructionsfor enabling the computer to provide the device according to theinvention are also provided.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described, by way of exampleonly, with reference to the accompanying drawings in which:

FIG. 1 is a schematic representation of a computer loaded with softwareembodying the present invention;

FIG. 2 is a functional block diagram showing program elements forsoftware embodying a known technique for performing DSR;

FIG. 3 is a functional block diagram showing program elements forsoftware embodying another known technique for performing DSR;

FIG. 4 is a functional block diagram of program elements that comprise aparameteriser of FIGS. 2 and 3;

FIG. 5 is a functional block diagram of the program elements thatcomprise the software indicated in FIG. 1;

FIG. 6 is a functional block diagram of the program elements whichcomprise a feature vector regenerator of FIG. 5;

FIG. 7 is a functional block diagram of the program elements thatcomprise a frame estimator shown in FIG. 6 in one embodiment of theinvention; and

FIG. 8 is a functional block diagram of the program elements thatcomprise the frame estimator shown in FIG. 6 in a second embodiment ofthe invention.

DETAILED DESCRIPTION OF NON-LIMITING EXEMPLARY EMBODIMENTS

FIG. 1 illustrates a conventional computer 101, such as a PC, running aconventional operating system 103, such as Windows (a Registered TradeMark of Microsoft Corporation), and having a number of residentapplication programs 105 including a word processing program, a networkbrowser and e-mail program and a database management program. Thecomputer 101 is also connected to a conventional disc storage unit 111for storing data and programs, a keyboard 113 and mouse 115 for allowinguser input and a printer 117 and display unit 119 for providing outputfrom the computer 101. The computer 101 also has access to externalnetworks (not shown) via a network card 121. The computer 101 alsoincludes a speech recognition program 109 that enables a speech signalreceived via the network card 121 to be recognised.

In FIG. 2, a mobile device 201 includes a framer 205 which divides areceived speech signal into short duration frames, for example 30 ms,and sends the resultant frames to an encoder 202. The encoder 202encodes each frame of received speech into a suitable codedrepresentation, for example using the standard codec defined inITU-G.723.1, and the resultant coded representation is sent to apacketiser 203. The coded representation forms the payload for a packet(not shown) which has a header added by the packetiser 203. The packetis transmitted via a connectionless network 206 to a remote device 204.The remote device 204 includes an unpacketiser 207 which removes theheader, and a decoder 208 which decodes the coded representation of thespeech frame. Speech frames are sent from the decoder 208 to an audioreconstructor 209 where the speech signal is reconstructed. The speechsignal is then parameterised by a parameteriser 210 to form featurevectors suitable for use by a speech recogniser 211. The parameteriser210 comprises a basic feature extractor 213 and a feature processor 212,the operation of which will be described later.

FIG. 3 shows a system for DSR which avoids encoding and decoding thespeech signal of the speech by transmitting parameterised speech signalsover the network 206. In FIG. 3 a device 201′ includes a basic featureextractor 213′ which parameterises speech signals to form featurevectors. The speech feature vectors are packetised by the packetiser 203and transmitted via the network 206 to a remote device 204′. At theremote device the features are un-packetised by the unpacketiser 207 andtransmitted to the speech recogniser 211 via the feature processor 212.This approach is advantageous over the system shown in FIG. 2. Encodingand decoding the signal causes a degradation in quality of the speechsignal. This causes a significant reduction in speech recognitionperformance. By parameterising the speech signal before transmissionacross the network there is no resultant loss in speech recognitionperformance. As encoding and decoding of the speech signal is notrequired there is a significant saving in computation.

The problem of missing packets needs to be addressed. In this inventionreconstructing the transmitted feature vector sequence is performed bydetecting mising feature vectors and subsequently estimatingcorresponding replacement feature vectors.

FIG. 5 shows a remote device 204′ according to the invention, which isimplemented on a conventional computer as illustrated in FIG. 1. Afterreceived features have been unpacketised by the un-packetiser 207,missing features are restored by a feature vector regenerator 214.

In the embodiment of the invention described here the basic featurevectors used are Mel-frequency cepstral coefficients (MFCCs). MFCCs aregenerated from a received speech signal as illustrated in FIG. 4. A highemphasis filter 10, normally referred to as a pre-emphasis filter,receives a digitised speech waveform at, for example, a sampling rate of8 kHz as a sequence of 8-bit numbers and performs a high emphasisfiltering process (for example by executing a 1–0.95.z⁻¹ filter), toincrease the amplitude of higher frequencies.

A sequence of 256 contiguous samples (referred to as a frame in thisdescription) of the filtered signal is windowed by a window processor 11in which the samples are multiplied by predetermined weighting constantsusing, in this embodiment of the invention, a Hamming window, to reducespurious artefacts generated at the edges of the frame. Each frameoverlaps with neighbouring frames by 50%, so as to provide one frameevery 16 ms.

Each frame of 256 windowed samples is then processed by a MFCC generator12 to extract an MFCC feature vector comprising eight MFCC's.

The MFCC feature vector is derived by performing a spectral transform,in this embodiment of the invention, a Fast Fourier Transform (FFT), oneach frame of a speech signal, to derive a representation of the signalspectrum for each frame of speech. The terms of the spectrum areintegrated into a series of broad bands, which are distributed in a‘mel-frequency’ scale along the frequency axis, to provide nineteenmel-frequency features. These features are referred to as filterbankfeatures in this description. The mel-frequency scale is a perceptuallymotivated scale, which comprises frequency bands evenly spaced on alinear frequency scale between 0 and 1 kHz, and evenly spaced on alogarithmic frequency scale above 1 kHz. The logarithm of eachmel-frequency feature is calculated and then a Discrete Cosine Transform(DCT) is performed to generate an MFCC feature vector for the frame.Features such as the mel-frequency features described above, whichrepresent the frequencies within a signal are referred to as beingfeatures in a spectral domain. Features which represent the rate ofchange of frequencies in a signal, such as the MFCC's described aboveare referred to as being in a cepstral domain.

For MFCC's it is found that the useful information is generally confinedto the lower order coefficients, so in this embodiment of the inventionnine cepstral coefficients are used.

Before the features are transmitted to the feature processor 212, anymissing front end feature vectors are restored by the feature vectorregenerator illustrated in FIG. 6.

Estimation of missing feature vectors is a simpler problem thanestimation of the original time-domain speech signal. Feature vectorsare highly correlated with one another in time, and represent a longerportion of speech than a single digital time-domain sample. In theembodiment of the invention described here 256 time-domain samples arerepresented by 9 MFCC's. The estimation of 9 MFCC's which are highlycorrelated with preceding and following MFCC's is much simpler thanaccurate estimation of 256 samples.

In FIG. 6 a stream of MFCCs is shown with one missing feature vector dueto the packet having that feature vector as part of it's payload beinglost in the network 206. Received MFCCs are first passed into a missingfeature vector detector 501 which identifies whether any feature vectorsare missing or not. If a missing feature vector is detected a featurevector estimator 502 is used to estimate the missing feature vector. Thefeature vector sequence is then reconstructed by the sequencereconstructor 503. The resultant reconstructed sequence may then be usedfor speech recognition is the usual way.

In the embodiment of the invention described here the missing featurevector detector 501 uses feature vector numbering. An additional featureis added to the feature vector by the basic feature extractor 213′, theadditional feature indicates the position of each feature vector in thefeature vector sequence. At the remote device 204″ the missing featurevector detector 501 checks the feature vector number of each featurevector received and uses this number to detect whether there are anymissing feature vectors, and if so how many. When one or more missingfeature vectors are detected a signal is sent to the feature vectorestimator 502.

The feature vector estimator 502 uses interpolation to estimate themissing speech features. Each feature is estimated separately and thetime series of each is used to form a polynomial which enables missingelements to be estimated. In this embodiment of the invention a simplestraight line interpolation is used. A detailed description ofinterpolation algorithms is provided in S. V. Vaseghi, “Advanced signalprocessing and digital noise reduction”, John-Wiley, 1996.

FIG. 7 illustrates interpolation of a MFCC feature vector. For eachfeature in the feature vector a corresponding interpolator 601, 602, . .. 609 is established. As each new feature vector arrives theinterpolation coefficients for each feature of the feature vector areupdated. When a missing feature vector is detected by the missingfeature vector detector 501 an estimate of the missing feature vector ismade using the interpolators 601, 602, . . . 609. The estimate of themissing frame is then inserted into the feature vector sequence by thesequence reconstructor 503.

In another embodiment of the invention the MFCC feature vectors, whichare in the cepstral domain are converted back into the spectral domainso that interpolation is performed on features which represent thefrequencies in the original signal. Upon detection of a missing featurevector the interpolator produces an estimate of a filterbank featurevector. This is then logged and a DCT applied to transform the estimateinto the MFCC domain. This is illustrated in FIG. 8 in which a sequenceof received MFCC feature vectors 701 has an inverse DCT applied to it at702, and is exponentiated at 703 (i.e. the inverse of a logarithm isapplied) to provide a sequence of filterbank feature vectors. Afilterbank interpolator 705 is used to provide a filterbank estimate 706of a missing feature vector, and the filterbank estimate 706 has alogarithm calculated at 707 and a DCT applied at 708 to provide an MFCCestimate 709.

After the feature vector sequence has been reconstructed by the featurevector regenerator 214, processing of the basic features prior torecognition is performed by the feature processor 212. RASTA filteringis applied by bandpass filtering the time series of feature vectors. Adetailed description of RASTA filtering may be found in H. Hermansky andN. Morgan, “RASTA processing of speech”, IEEE Trans. Speech and AudioProc., vol. 2, no. 4, pp. 578–589, October 1994. Any channel distortionis additive in the cepstral domain, so applying a sharp cut-off highpassfilter to each of the features, across time, removes any offset andhence suppresses channel distortion. Cepstral Time Matrix (CTM) featuresare then calculated by taking a DCT across a sequence of seven MFCCs. Adetailed description of CTM features may be found in B. P. Milner,“Inclusion of temporal information into features for speechrecognition”, Proc. ICSLP, pp. 256–259, 1996.

It will be appreciated by those skilled in the art that the techniquedescribed could be applied to other types of basic speechparameterisation. Cepstral features may be calculated using a Fouriertransform as described here, or using linear predictive (LP) analysis.It can be proven that the resultant cepstrum from either of these tworoutes is identical. In the embodiment of the invention described herethe Fourier transform based cepstrum has been modified to include amel-scale filterbank resulting in MFCC's.

A process similar to the mel-scale filterbank is used in perceptuallinear predictive (PLP) analysis where a set of critical-band filtersare convolved with the speech spectrum. These modify the spectrumaccording to perceptual measurements of human hearing and lead to thePLP cepstrum.

It will also be appreciated by those skilled in the art that otherfeature vector processing techniques could be applied, for exampledifferential features may be calculated, such as ‘velocity’ and‘acceleration’ of the basic features. Cepstral mean normalisation inwhich the average of each feature is subtracted from each featurerespectively, may be used. Linear discriminant analysis (LDA) asdescribed in E. J. Paris and M. J. Carey, “Estimating lineardiscriminant parameters for continuous density HMMs”, Proc. ICSLP, pp.215–218, 1994 may also be used.

As will be understood by those skilled in the art, the speechrecognition program 109 can be contained on various transmission and/orstorage mediums such as a floppy disc, CD-ROM, or magnetic tape so thatthe program can be loaded onto one or more general purpose computers orcould be downloaded over a computer network using a suitabletransmission medium.

Unless the context clearly requires otherwise, throughout thedescription and the claims, the words “comprise”, “comprising” and thelike are to be construed in an inclusive as opposed to an exclusive orexhaustive sense; that is to say, in the sense of “including, but notlimited to”.

1. A method of speech recognition comprising: receiving a sequence of transmitted feature vectors, said feature vectors representing a speech signal and comprising features in a parameterized domain determined by processing basic features of the speech signal; detecting the absence of a feature vector in the received sequence; generating an estimated replacement feature vector for the detected absent feature vector by converting a received differential feature vector to a spectral domain, estimating a spectral component of said feature vector by interpolating the corresponding component of the converted feature vector, and converting the estimated spectral component to said parameterized domain of the frequencies of the speech signal; inserting said replacement feature vector into the received feature vector sequence to provide a modified feature vector sequence; and performing speech recognition upon the modified feature vector sequence.
 2. The method as in claim 1, wherein said parameterized domain comprises a cepstral domain.
 3. The method as in claim 1, wherein said parameterized domain comprises a domain determined using a feature vector processing technique which calculates a differential of said transmitted feature vectors.
 4. The method as in claim 1, in which said estimating the spectral component of said feature vector uses an interpolation coefficient corresponding to a spectral component of the received feature vector and further comprises updating the interpolation coefficient in response according to one or more received feature vectors.
 5. The method as in claim 1, in which a received feature vector includes an additional feature which indicates the position of each feature vector in the sequence of transmitted feature vectors.
 6. The method as in claim 5, wherein in said detecting the absence of a feature vector, a feature vector is determined to be missing from the received sequence of feature vectors by checking the feature vector number of each feature vector received.
 7. The method as in claim 6, wherein in said sequence of transmitted feature vectors, said generating an estimated replacement feature vector is performed separately for each missing feature vector.
 8. The method as in claim 6, wherein in said estimating a spectral component of a replacement feature vector, the corresponding component of the converted feature vector is determined by interpolating a polynomial formed from a time series of said feature vector.
 9. A device for performing speech recognition upon a sequence of feature vectors representing a speech signal, wherein said feature vectors comprise features in a parameterized domain determined by processing basic features of the speech signal, the device comprising: a missing feature vector detector arranged in operation to receive the transmitted feature vectors and to indicate the absence of a feature vector in the received sequence; a feature vector estimator arranged, in operation, to receive transmitted feature vectors and responsive to said indication from the missing feature vector detector to estimate a replacement feature vector, wherein said feature vector estimator comprises: a first converter for converting a received feature vector of said parameterized domain to a spectral domain, an estimator for estimating a spectral component by interpolating the corresponding component of the converted frame, and a second converter for converting the estimated spectral component to said parameterized domain; a sequence reconstructer arranged, in operation, to receive transmitted feature vectors and to receive a replacement feature vector and to provide as an output a modified feature vector sequence; and a speech recognizer arranged, in operation, to receive the modified feature vector sequence.
 10. The device as in claim 9, wherein said parameterized domain comprises a cepstral domain.
 11. The device as in claim 9, wherein said parameterized domain comprises a domain determined using a feature vector processing technique which calculates a differential of said transmitted feature vectors.
 12. The device as in claim 9, in which the interpolating uses an interpolation coefficient corresponding to a component of the received feature vector and the interpolation coefficient is updated in response to receipt of a feature vector.
 13. A data carrier loadable into and readable by a computer, and carrying instructions for causing the computer to carry out the method according to claim
 1. 14. A data carrier loadable into and readable by a computer, and carrying instructions for enabling the computer to provide the device according to claim
 9. 